Finding and Typing New Named Entities in Tibetan from Chinese-Tibetan Parallel Corpora
نویسنده
چکیده
Currently there is much interest in the automatic acquisition of entities, with the goal of Named Entity Recognition (NER). However previous work has focused primarily on major languages, with the large, structured, and semantically rich knowledge bases and using the large corpus with annotated NER tags. In this paper, we describe a method for Chinese-Tibetan bilingual named entity recognition using easily obtainable bilingual dictionary and parallel political corpora. We present two distinct steps for NER, one step identifying entity candidates in Tibetan, and the second step typing the entity into the semantic class. We then test the approach on the dataset and give the analysis of NE type errors.
منابع مشابه
Tibetan Unknown Word Identification from News Corpora for Supporting Lexicon-based Tibetan Word Segmentation
In Tibetan, as words are written consecutively without delimiters, finding unknown word boundary is difficult. This paper presents a hybrid approach for Tibetan unknown word identification for offline corpus processing. Firstly, Tibetan named entity is preprocessed based on natural annotation. Secondly, other Tibetan unknown words are extracted from word segmentation fragments using MTC, the co...
متن کاملTibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
Sentence-level aligning bilingual parallel corpus is shown significant and indispensable status in machine translation, translation knowledge acquiring and bilingual lexicography research fields, which is the fundamental work for natural language processing. Given the great deal of work in sentence alignment and a variety of methods have developed for bilingual terminology extraction, those are...
متن کاملTibetan-Chinese Cross Language Text Similarity Calculation Based on LDA Topic Model
Topic model building is the basis and the most critical module of cross-language topic detection and tracking. Topic model also can be applied to cross-language text similarity calculation. It can improve the efficiency and the speed of calculation by reducing the texts’ dimensionality. In this paper, we use the LDA model in cross-language text similarity computation to obtain Tibetan-Chinese c...
متن کاملClustering Research across Tibetan and Chinese Texts
Tibetan text clustering has potential in Tibetan information processing domain. In this paper, clustering research across Chinese and Tibetan texts is proposed to benefit Chinese and Tibetan machine translation and sentence alignment. A Tibetan and Chinese keyword table is the main way to implement the text clustering across these two languages. Improved Kmeans and improved density-based spatia...
متن کاملUsing Word Embeddings to Translate Named Entities
In this paper we investigate the usefulness of neural word embeddings in the process of translating Named Entities (NEs) from a resource-rich language to a language low on resources relevant to the task at hand, introducing a novel, yet simple way of obtaining bilingual word vectors. Inspired by observations in (Mikolov et al., 2013b), which show that training their word vector model on compara...
متن کامل